#Introduction Unbanked individuals represent both an underserved demographic as well as a typically untapped market by reputable creditors. Home Credit seeks to fill this gap in service. There are unique challenges that accompany establishing creditworthiness among a population that by definition has little to no financial history, verifiable assets, or traditional means to qualify for a loan. ## Project Scope This project will utilize machine learning algorithms to develop a classification model which will use available data about Home Credit customers to improve prediction of those that are likely to repay loans granted by Home Credit. The team will test a number of possible classification models in order to develop the most accurate model on data outside the training data. An added benefit of this project is the potential to identify possible additional data that might further inform the classification model. A successful model will provide greater predictive power than a simple prediction based upon majority class statistics and will allow Home Credit to loan to customers with confidence that repayment will in return grow available assets to the company in order to further its mission of providing credit to the underserved.
## 'data.frame': 307511 obs. of 122 variables:
## $ SK_ID_CURR : int 100002 100003 100004 100006 100007 100008 100009 100010 100011 100012 ...
## $ TARGET : int 1 0 0 0 0 0 0 0 0 0 ...
## $ NAME_CONTRACT_TYPE : chr "Cash loans" "Cash loans" "Revolving loans" "Cash loans" ...
## $ CODE_GENDER : chr "M" "F" "M" "F" ...
## $ FLAG_OWN_CAR : chr "N" "N" "Y" "N" ...
## $ FLAG_OWN_REALTY : chr "Y" "N" "Y" "Y" ...
## $ CNT_CHILDREN : int 0 0 0 0 0 0 1 0 0 0 ...
## $ AMT_INCOME_TOTAL : num 202500 270000 67500 135000 121500 ...
## $ AMT_CREDIT : num 406598 1293503 135000 312683 513000 ...
## $ AMT_ANNUITY : num 24701 35699 6750 29687 21866 ...
## $ AMT_GOODS_PRICE : num 351000 1129500 135000 297000 513000 ...
## $ NAME_TYPE_SUITE : chr "Unaccompanied" "Family" "Unaccompanied" "Unaccompanied" ...
## $ NAME_INCOME_TYPE : chr "Working" "State servant" "Working" "Working" ...
## $ NAME_EDUCATION_TYPE : chr "Secondary / secondary special" "Higher education" "Secondary / secondary special" "Secondary / secondary special" ...
## $ NAME_FAMILY_STATUS : chr "Single / not married" "Married" "Single / not married" "Civil marriage" ...
## $ NAME_HOUSING_TYPE : chr "House / apartment" "House / apartment" "House / apartment" "House / apartment" ...
## $ REGION_POPULATION_RELATIVE : num 0.0188 0.00354 0.01003 0.00802 0.02866 ...
## $ DAYS_BIRTH : int -9461 -16765 -19046 -19005 -19932 -16941 -13778 -18850 -20099 -14469 ...
## $ DAYS_EMPLOYED : int -637 -1188 -225 -3039 -3038 -1588 -3130 -449 365243 -2019 ...
## $ DAYS_REGISTRATION : num -3648 -1186 -4260 -9833 -4311 ...
## $ DAYS_ID_PUBLISH : int -2120 -291 -2531 -2437 -3458 -477 -619 -2379 -3514 -3992 ...
## $ OWN_CAR_AGE : num NA NA 26 NA NA NA 17 8 NA NA ...
## $ FLAG_MOBIL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_EMP_PHONE : int 1 1 1 1 1 1 1 1 0 1 ...
## $ FLAG_WORK_PHONE : int 0 0 1 0 0 1 0 1 0 0 ...
## $ FLAG_CONT_MOBILE : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FLAG_PHONE : int 1 1 1 0 0 1 1 0 0 0 ...
## $ FLAG_EMAIL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ OCCUPATION_TYPE : chr "Laborers" "Core staff" "Laborers" "Laborers" ...
## $ CNT_FAM_MEMBERS : num 1 2 1 2 1 2 3 2 2 1 ...
## $ REGION_RATING_CLIENT : int 2 1 2 2 2 2 2 3 2 2 ...
## $ REGION_RATING_CLIENT_W_CITY : int 2 1 2 2 2 2 2 3 2 2 ...
## $ WEEKDAY_APPR_PROCESS_START : chr "WEDNESDAY" "MONDAY" "MONDAY" "WEDNESDAY" ...
## $ HOUR_APPR_PROCESS_START : int 10 11 9 17 11 16 16 16 14 8 ...
## $ REG_REGION_NOT_LIVE_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LIVE_REGION_NOT_WORK_REGION : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_LIVE_CITY : int 0 0 0 0 0 0 0 0 0 0 ...
## $ REG_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 0 1 0 0 ...
## $ LIVE_CITY_NOT_WORK_CITY : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ORGANIZATION_TYPE : chr "Business Entity Type 3" "School" "Government" "Business Entity Type 3" ...
## $ EXT_SOURCE_1 : num 0.083 0.311 NA NA NA ...
## $ EXT_SOURCE_2 : num 0.263 0.622 0.556 0.65 0.323 ...
## $ EXT_SOURCE_3 : num 0.139 NA 0.73 NA NA ...
## $ APARTMENTS_AVG : num 0.0247 0.0959 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_AVG : num 0.0369 0.0529 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_AVG : num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_AVG : num 0.619 0.796 NA NA NA ...
## $ COMMONAREA_AVG : num 0.0143 0.0605 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_AVG : num 0 0.08 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_AVG : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_AVG : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_AVG : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_AVG : num 0.0369 0.013 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_AVG : num 0.0202 0.0773 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_AVG : num 0.019 0.0549 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_AVG : num 0 0.0039 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_AVG : num 0 0.0098 NA NA NA NA NA NA NA NA ...
## $ APARTMENTS_MODE : num 0.0252 0.0924 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_MODE : num 0.0383 0.0538 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_MODE: num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_MODE : num 0.634 0.804 NA NA NA ...
## $ COMMONAREA_MODE : num 0.0144 0.0497 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_MODE : num 0 0.0806 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_MODE : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_MODE : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_MODE : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_MODE : num 0.0377 0.0128 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_MODE : num 0.022 0.079 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_MODE : num 0.0198 0.0554 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_MODE : num 0 0 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_MODE : num 0 0 NA NA NA NA NA NA NA NA ...
## $ APARTMENTS_MEDI : num 0.025 0.0968 NA NA NA NA NA NA NA NA ...
## $ BASEMENTAREA_MEDI : num 0.0369 0.0529 NA NA NA NA NA NA NA NA ...
## $ YEARS_BEGINEXPLUATATION_MEDI: num 0.972 0.985 NA NA NA ...
## $ YEARS_BUILD_MEDI : num 0.624 0.799 NA NA NA ...
## $ COMMONAREA_MEDI : num 0.0144 0.0608 NA NA NA NA NA NA NA NA ...
## $ ELEVATORS_MEDI : num 0 0.08 NA NA NA NA NA NA NA NA ...
## $ ENTRANCES_MEDI : num 0.069 0.0345 NA NA NA NA NA NA NA NA ...
## $ FLOORSMAX_MEDI : num 0.0833 0.2917 NA NA NA ...
## $ FLOORSMIN_MEDI : num 0.125 0.333 NA NA NA ...
## $ LANDAREA_MEDI : num 0.0375 0.0132 NA NA NA NA NA NA NA NA ...
## $ LIVINGAPARTMENTS_MEDI : num 0.0205 0.0787 NA NA NA NA NA NA NA NA ...
## $ LIVINGAREA_MEDI : num 0.0193 0.0558 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAPARTMENTS_MEDI : num 0 0.0039 NA NA NA NA NA NA NA NA ...
## $ NONLIVINGAREA_MEDI : num 0 0.01 NA NA NA NA NA NA NA NA ...
## $ FONDKAPREMONT_MODE : chr "reg oper account" "reg oper account" "" "" ...
## $ HOUSETYPE_MODE : chr "block of flats" "block of flats" "" "" ...
## $ TOTALAREA_MODE : num 0.0149 0.0714 NA NA NA NA NA NA NA NA ...
## $ WALLSMATERIAL_MODE : chr "Stone, brick" "Block" "" "" ...
## $ EMERGENCYSTATE_MODE : chr "No" "No" "" "" ...
## $ OBS_30_CNT_SOCIAL_CIRCLE : num 2 1 0 2 0 0 1 2 1 2 ...
## $ DEF_30_CNT_SOCIAL_CIRCLE : num 2 0 0 0 0 0 0 0 0 0 ...
## $ OBS_60_CNT_SOCIAL_CIRCLE : num 2 1 0 2 0 0 1 2 1 2 ...
## $ DEF_60_CNT_SOCIAL_CIRCLE : num 2 0 0 0 0 0 0 0 0 0 ...
## $ DAYS_LAST_PHONE_CHANGE : num -1134 -828 -815 -617 -1106 ...
## $ FLAG_DOCUMENT_2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ FLAG_DOCUMENT_3 : int 1 1 0 1 0 1 0 1 1 0 ...
## $ FLAG_DOCUMENT_4 : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.08073 0.00000 1.00000
There are 307511 rows and 122 features including TARGET variable.The target is categorical variable classified as “0” for the loan was repaid on time with no difficulties and “1” indicating the client had payment difficulties.
#Checking missing value in target variable
missing_count <- sum(is.na(app_train$TARGET))
missing_count## [1] 0
There are no missing values in Target variable
##
## 0 1
## 282686 24825
Here 24825 clients are observed to have payment difficulties, whereas 282686 people have no payment difficulties.
library(ggplot2)
ggplot(app_train, aes(x = as.factor(TARGET))) +
geom_bar(fill = "steelblue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, color = "black", size = 3) +
labs(x = "Target Variable", y = "Count") +
ggtitle("Distribution of the Target Variable") +
scale_y_continuous(breaks = seq(0, max(table(app_train$TARGET)), by = 50000))
Above Target variable distribution shows that majority clients come
under no payment difficulties class(0), only 24825 clients have payment
difficulties.
# View the proportion of target variable
proportion <- prop.table(table(app_train$TARGET)) * 100
proportion##
## 0 1
## 91.927118 8.072882
91.92% of the clients are observed to have no payment difficulties, 8.07% clients face payment difficulties.The data looks unbalaced with respect target variable as proportion for class 0 is much higher (91.92%) than that of class 1 (8.07%). Special attention should also be given to model Sensitivity and Specificity.
# Calculate the proportion and convert it to percentages
proportion <- prop.table(table(app_train$TARGET)) * 100
# Create a data frame for plotting
pie_data <- data.frame(category = names(proportion), proportion = proportion)
# Create the pie chart using ggplot
pie_chart <- ggplot(pie_data, aes(x = "", y = proportion, fill = category)) +
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void() +
labs(fill = "Category") +
geom_text(aes(label = paste0(round(proportion, 2), "%")), position = position_stack(vjust = 0.5)) +
labs(title = "Proportion of TARGET Categories")
# Display the pie chart
print(pie_chart)# Count of majority class
majority_count <- max(table(app_train$TARGET))
# Total number of clients in the target class
total_count <- length(app_train$TARGET)
# Accuracy for majority class
accuracy <- majority_count / total_count
accuracy## [1] 0.9192712
The accuracy for majority class classifier “0” (client with no payment difficulties) is 91.92% which is same as the proportion of class “0” clients in the data.
#Getting list of Top5 positive and top 5 negative predictors
# Separate numeric predictors
numeric_vars <- app_train %>%
select_if(is.numeric)
# Calculate correlation between numeric predictors and target variable
cor_target <- cor(numeric_vars, numeric_vars$TARGET, use = "pairwise.complete.obs", method = "pearson")
cor_target## [,1]
## SK_ID_CURR -0.0021084690
## TARGET 1.0000000000
## CNT_CHILDREN 0.0191871336
## AMT_INCOME_TOTAL -0.0039818656
## AMT_CREDIT -0.0303692865
## AMT_ANNUITY -0.0128165615
## AMT_GOODS_PRICE -0.0396452812
## REGION_POPULATION_RELATIVE -0.0372271485
## DAYS_BIRTH 0.0782393083
## DAYS_EMPLOYED -0.0449316627
## DAYS_REGISTRATION 0.0419748628
## DAYS_ID_PUBLISH 0.0514571726
## OWN_CAR_AGE 0.0376115643
## FLAG_MOBIL 0.0005343956
## FLAG_EMP_PHONE 0.0459822197
## FLAG_WORK_PHONE 0.0285243224
## FLAG_CONT_MOBILE 0.0003701268
## FLAG_PHONE -0.0238062723
## FLAG_EMAIL -0.0017583834
## CNT_FAM_MEMBERS 0.0093077844
## REGION_RATING_CLIENT 0.0588990149
## REGION_RATING_CLIENT_W_CITY 0.0608926676
## HOUR_APPR_PROCESS_START -0.0241658314
## REG_REGION_NOT_LIVE_REGION 0.0055759445
## REG_REGION_NOT_WORK_REGION 0.0069419075
## LIVE_REGION_NOT_WORK_REGION 0.0028194792
## REG_CITY_NOT_LIVE_CITY 0.0443953748
## REG_CITY_NOT_WORK_CITY 0.0509944644
## LIVE_CITY_NOT_WORK_CITY 0.0325183411
## EXT_SOURCE_1 -0.1553171261
## EXT_SOURCE_2 -0.1604716716
## EXT_SOURCE_3 -0.1789186976
## APARTMENTS_AVG -0.0294975646
## BASEMENTAREA_AVG -0.0227457432
## YEARS_BEGINEXPLUATATION_AVG -0.0097276733
## YEARS_BUILD_AVG -0.0221492841
## COMMONAREA_AVG -0.0185496616
## ELEVATORS_AVG -0.0341987937
## ENTRANCES_AVG -0.0191721820
## FLOORSMAX_AVG -0.0440033705
## FLOORSMIN_AVG -0.0336135033
## LANDAREA_AVG -0.0108848227
## LIVINGAPARTMENTS_AVG -0.0250305333
## LIVINGAREA_AVG -0.0329971175
## NONLIVINGAPARTMENTS_AVG -0.0031761075
## NONLIVINGAREA_AVG -0.0135780698
## APARTMENTS_MODE -0.0272838707
## BASEMENTAREA_MODE -0.0199522839
## YEARS_BEGINEXPLUATATION_MODE -0.0090364474
## YEARS_BUILD_MODE -0.0220682038
## COMMONAREA_MODE -0.0163403373
## ELEVATORS_MODE -0.0321311710
## ENTRANCES_MODE -0.0173874223
## FLOORSMAX_MODE -0.0432262632
## FLOORSMIN_MODE -0.0326978253
## LANDAREA_MODE -0.0101741036
## LIVINGAPARTMENTS_MODE -0.0233932456
## LIVINGAREA_MODE -0.0306846158
## NONLIVINGAPARTMENTS_MODE -0.0015565608
## NONLIVINGAREA_MODE -0.0127105437
## APARTMENTS_MEDI -0.0291837589
## BASEMENTAREA_MEDI -0.0220812614
## YEARS_BEGINEXPLUATATION_MEDI -0.0099930966
## YEARS_BUILD_MEDI -0.0223259265
## COMMONAREA_MEDI -0.0185728689
## ELEVATORS_MEDI -0.0338628768
## ENTRANCES_MEDI -0.0190247563
## FLOORSMAX_MEDI -0.0437679210
## FLOORSMIN_MEDI -0.0333942872
## LANDAREA_MEDI -0.0112558266
## LIVINGAPARTMENTS_MEDI -0.0246206636
## LIVINGAREA_MEDI -0.0327392844
## NONLIVINGAPARTMENTS_MEDI -0.0027571486
## NONLIVINGAREA_MEDI -0.0133367200
## TOTALAREA_MODE -0.0325955468
## OBS_30_CNT_SOCIAL_CIRCLE 0.0091306657
## DEF_30_CNT_SOCIAL_CIRCLE 0.0322475793
## OBS_60_CNT_SOCIAL_CIRCLE 0.0090221436
## DEF_60_CNT_SOCIAL_CIRCLE 0.0312764721
## DAYS_LAST_PHONE_CHANGE 0.0552184835
## FLAG_DOCUMENT_2 0.0054171443
## FLAG_DOCUMENT_3 0.0443463469
## FLAG_DOCUMENT_4 -0.0026720822
## FLAG_DOCUMENT_5 -0.0003157774
## FLAG_DOCUMENT_6 -0.0286018940
## FLAG_DOCUMENT_7 -0.0015195032
## FLAG_DOCUMENT_8 -0.0080403845
## FLAG_DOCUMENT_9 -0.0043524086
## FLAG_DOCUMENT_10 -0.0014138916
## FLAG_DOCUMENT_11 -0.0042293497
## FLAG_DOCUMENT_12 -0.0007557507
## FLAG_DOCUMENT_13 -0.0115832223
## FLAG_DOCUMENT_14 -0.0094638219
## FLAG_DOCUMENT_15 -0.0065356576
## FLAG_DOCUMENT_16 -0.0116146716
## FLAG_DOCUMENT_17 -0.0033775691
## FLAG_DOCUMENT_18 -0.0079523851
## FLAG_DOCUMENT_19 -0.0013575183
## FLAG_DOCUMENT_20 0.0002153936
## FLAG_DOCUMENT_21 0.0037086250
## AMT_REQ_CREDIT_BUREAU_HOUR 0.0009304246
## AMT_REQ_CREDIT_BUREAU_DAY 0.0027044014
## AMT_REQ_CREDIT_BUREAU_WEEK 0.0007877117
## AMT_REQ_CREDIT_BUREAU_MON -0.0124624192
## AMT_REQ_CREDIT_BUREAU_QRT -0.0020219274
## AMT_REQ_CREDIT_BUREAU_YEAR 0.0199298586
# Create a data frame with predictor names and correlations
df <- data.frame(Target = "TARGET", Variable = colnames(numeric_vars), Correlation = cor_target)
# Sort the data frame by correlation in descending order
sorted_df <- df[order(-abs(df$Correlation)), , drop = FALSE]
# Extract the top 10 positive and top 10 negative correlations
top_pos <- head(sorted_df[sorted_df$Correlation > 0, ], 10)
top_neg <- head(sorted_df[sorted_df$Correlation < 0, ], 10)##Visualizing correlation using heatmap for numerical predictors with target
# Convert the correlation matrix to a data frame
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
cor_df <- melt(cor_matrix)
# Rename the columns
colnames(cor_df) <- c("Variable1", "Variable2", "Correlation")
# Sort the correlations by the absolute value
cor_df <- cor_df[order(-abs(cor_df$Correlation)), ]
# Filter the correlations involving the target variable
cor_target <- subset(cor_df, Variable1 == "TARGET" | Variable2 == "TARGET")
# Select the top positive and negative correlations
top_pos <- head(subset(cor_target, Correlation > 0), 5)
top_neg <- head(subset(cor_target, Correlation < 0), 5)
# Combine the top positive and negative correlations
top_cor <- rbind(top_pos, top_neg)
# Filter the correlations involving the top positive and negative predictors
top_predictors <- c(top_pos$Variable1, top_pos$Variable2, top_neg$Variable1, top_neg$Variable2)
cor_top_predictors <- subset(cor_df, Variable1 %in% top_predictors & Variable2 %in% top_predictors)
# Visualize the correlations using a heatmap
ggplot(cor_top_predictors, aes(x = Variable1, y = Variable2, fill = Correlation)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
geom_text(aes(label = round(Correlation, 2)), color = "black") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap of Top Predictors")
DAYS_BIRTH, REGION_RATING_CLIENT_W_CITY, REGION_RATING_CLIENT are the
top 3 numerical predictors with positive correlation wrt TARGET, whereas
EXT_SOURCE_3, EXT_SOURCE_2, EXT_SOURCE_1 are top 3 numerical predictors
with negative correlation wrt TARGET.
# Select character variables
character_vars <- app_train %>%
select_if(is.character)
# Create a data frame to store correlation values
cor_df <- data.frame(variable = character(), correlation = numeric(), stringsAsFactors = FALSE)
# Calculate correlation with the target variable
for (var_name in colnames(character_vars)) {
var <- character_vars[[var_name]]
correlation <- cor(as.numeric(as.factor(var)), app_train$TARGET, use = "pairwise.complete.obs")
cor_df <- cor_df %>% add_row(variable = var_name, correlation = correlation)
}## Listing top categorical predictors
# Sort the data frame by correlation in descending order
sorted_df <- cor_df %>% arrange(desc(correlation))
# Select the top 5 categorical predictors
top_categorical_predictors <- head(sorted_df, 10)
# Print the top 5 categorical predictors
print(top_categorical_predictors)## variable correlation
## 1 NAME_EDUCATION_TYPE 0.054698602
## 2 CODE_GENDER 0.054692262
## 3 NAME_INCOME_TYPE 0.046829434
## 4 OCCUPATION_TYPE 0.041419035
## 5 NAME_HOUSING_TYPE 0.034488603
## 6 NAME_TYPE_SUITE 0.009692820
## 7 WEEKDAY_APPR_PROCESS_START 0.004001745
## 8 NAME_FAMILY_STATUS -0.004127226
## 9 FLAG_OWN_REALTY -0.006148388
## 10 FLAG_OWN_CAR -0.021850938
# Reorder the levels of the categorical predictor variable
top_categorical_predictors$variable <- factor(top_categorical_predictors$variable,
levels = top_categorical_predictors$variable[order(top_categorical_predictors$correlation)])##Visualizing categorical predictors with Target
# Create a bar plot of the top categorical predictors
ggplot(top_categorical_predictors, aes(x = variable, y = correlation, label = round(correlation, 2))) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(vjust = -0.5) +
labs(x = "Categorical Predictor", y = "Correlation") +
ggtitle("Top Categorical Predictors with Target Variable") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_flip() +
scale_fill_manual(values = c("steelblue" = "steelblue"), guide = "none")
NAME_EDUCATION_TYPE, CODE_GENDER, NAME_INCOME_TYPE are the top 3
categorical predictors which has positive correlation wrt TARGET.
Join application_{train|test}.csv with transactional data in, for example, bureau.csv or previous_application.csv. This will require aggregating the transactional data to have the same grain as the application data.
We can average together a select number of columns to condense the bureau table down to the client level from the credit level, which will help us join it to our application data (also at the client level).
data2 <- group_by(data2,SK_ID_CURR) %>%
summarize(avg_days_credit = mean(DAYS_CREDIT),
avg_credit_day_overdue = mean(CREDIT_DAY_OVERDUE),
avg_days_credit_enddate = mean(DAYS_CREDIT_ENDDATE),
avg_amt_credit_max_overdue = mean(AMT_CREDIT_MAX_OVERDUE),
avg_cnt_credit_prolong = mean(CNT_CREDIT_PROLONG),
avg_amt_credit_sum = mean(AMT_CREDIT_SUM),
avg_amt_credit_sum_debt = mean(AMT_CREDIT_SUM_DEBT),
avg_amt_credit_sum_limit = mean(AMT_CREDIT_SUM_LIMIT),
avg_amt_credit_sum_overdue = mean(AMT_CREDIT_SUM_OVERDUE))
head(data2)## SK_ID_CURR avg_days_credit avg_credit_day_overdue avg_days_credit_enddate avg_amt_credit_max_overdue avg_cnt_credit_prolong avg_amt_credit_sum avg_amt_credit_sum_debt avg_amt_credit_sum_limit
## Min. :100001 Min. :-2922.0 Min. : 0.0000 Min. :-41875.0 Min. : 0 Min. :0.00000 Min. : 0 Min. :-1083615 Min. : -49014
## 1st Qu.:188879 1st Qu.:-1431.0 1st Qu.: 0.0000 1st Qu.: -718.0 1st Qu.: 0 1st Qu.:0.00000 1st Qu.: 103960 1st Qu.: 0 1st Qu.: 0
## Median :277895 Median :-1052.5 Median : 0.0000 Median : -137.7 Median : 0 Median :0.00000 Median : 197287 Median : 43190 Median : 0
## Mean :278047 Mean :-1083.8 Mean : 0.9659 Mean : 597.8 Mean : 7559 Mean :0.00701 Mean : 380721 Mean : 169017 Mean : 5167
## 3rd Qu.:367185 3rd Qu.: -670.4 3rd Qu.: 0.0000 3rd Qu.: 569.0 3rd Qu.: 2037 3rd Qu.:0.00000 3rd Qu.: 397831 3rd Qu.: 145816 3rd Qu.: 0
## Max. :456255 Max. : 0.0 Max. :2776.0000 Max. : 31198.0 Max. :47406123 Max. :6.00000 Max. :198072344 Max. :51750000 Max. :1755000
## NA's :76830 NA's :275677 NA's :13 NA's :126580 NA's :208930
## avg_amt_credit_sum_overdue
## Min. : 0
## 1st Qu.: 0
## Median : 0
## Mean : 46
## 3rd Qu.: 0
## Max. :1617404
##
Explore the joined transactional data. Do some of the added columns show promise in predicting default?
skm <- skim(data2)
# Create DF showing Columns with missing data
skm_missing <- select(skm, skim_variable, n_missing, complete_rate) %>% filter(n_missing > 0) %>% arrange(complete_rate)
skm_missingSome of these columns definitely show promise. There are 9 additional dimensions and only 5 of them have missing values. Of those 5 missing values, only 2 are missing more than half of their values. Only 2-3 of the 9 additional dimensions are missing substantial data.
Now that the data is joined, let’s see how many records we could add additional dimensions to.
skm2 <- skim(data3)
# Create DF showing Columns with missing data
skm_missing2 <- select(skm2, skim_variable, n_missing, complete_rate) %>% filter(n_missing > 0) %>% arrange(complete_rate)
skm_missing2Avg_amt_credit_max_overdue only fills in about 9% of records and avg_amt_credit_sum_limit only matches about 28%. Those two variables may have to be thrown out, but all of the rest of them have matches of 50% or more. Many of the variables match a whopping 86% of records. This data is definitely usable and could be of value with the use of a model.
The dataset is highly dimensional and has 121 independent variables listed. It is important to understand the degree of completeness of these variables in order to determine their usefulness in the model building phase of the project.
skm <- skim(data)
# Create DF showing Columns with missing data
skm_missing <- select(skm, skim_variable, n_missing, complete_rate) %>% filter(n_missing > 0) %>% arrange(complete_rate)
skm_missingOf the 121 independent variables, it turns out half (61 rows) have missing data. Before determining what to do about each of these variables (removal, imputation, etc.) it is important to examine the columns and have a better understanding of what each of these columns may contribute in terms of useful information towards solving the eventual question of whether or not the customer is likely to default on the loan.
skm_char <- skm %>% filter(skim_type == 'character') %>% select(skim_variable,character.min,character.max,character.empty) %>% filter(character.empty > 0)
skm_charSix of the 16 character variables contain missing information. Of those 6 columns, only 2 contain information that is informative towards the likelihood of customer default, NAME_TYPE_SUITE (NA = 1292) and OCCUPATION_TYPE (NA=96391). In both of these cases it is impossible know whether the NA signifies the lack of a value (i.e. unhoused or unemployed) of if the information is just incomplete. Therefore it is proposed to fill all NAs in both of this columns with a new value ‘unknown’. The other 4 columns will be dropped in light of the large number of missing values, and the small amount of variance in the values, in addition to the lack of relevance to the target variable outcome.
Of the 61 variables that have missing data, 47 of these variables describe the housing of applicant in significant detail. While an understanding of the housing situation is likely to be informative of the customer’s default risk, the detail provided in these columns is probably not necessary, much of the important information is included in NAME_HOUSING_TYPE and FLAG_OWN_REALTY which are both complete. It may be informative to disaggregate those that live in a house as opposed to an apartment, and we can look further into this possibility. However, inclusion of these 47 variables in the final model is probably unnecessary and so missing values in these columns can be ignored.
housing_stats_cols <- c('APARTMENTS_AVG','BASEMENTAREA_AVG','YEARS_BEGINEXPLUATATION_AVG','YEARS_BUILD_AVG','COMMONAREA_AVG',
'ELEVATORS_AVG','ENTRANCES_AVG','FLOORSMAX_AVG','FLOORSMIN_AVG','LANDAREA_AVG','LIVINGAPARTMENTS_AVG',
'LIVINGAREA_AVG','NONLIVINGAPARTMENTS_AVG','NONLIVINGAREA_AVG','APARTMENTS_MODE','BASEMENTAREA_MODE',
'YEARS_BEGINEXPLUATATION_MODE','YEARS_BUILD_MODE','COMMONAREA_MODE','ELEVATORS_MODE','ENTRANCES_MODE',
'FLOORSMAX_MODE','FLOORSMIN_MODE','LANDAREA_MODE','LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE',
'NONLIVINGAPARTMENTS_MODE','NONLIVINGAREA_MODE','APARTMENTS_MEDI','BASEMENTAREA_MEDI',
'YEARS_BEGINEXPLUATATION_MEDI','YEARS_BUILD_MEDI','COMMONAREA_MEDI','ELEVATORS_MEDI','ENTRANCES_MEDI',
'FLOORSMAX_MEDI','FLOORSMIN_MEDI','LANDAREA_MEDI','LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI',
'NONLIVINGAPARTMENTS_MEDI','NONLIVINGAREA_MEDI','FONDKAPREMONT_MODE','HOUSETYPE_MODE','TOTALAREA_MODE',
'WALLSMATERIAL_MODE','EMERGENCYSTATE_MODE')
# Remove housing data columms
skm_missing %>% filter(!skim_variable %in% housing_stats_cols)Included in the dataset are 3 externally sourced scores which have been normalized. There is wide variation in the number of values missing in each of these sources (\(NA_1=173378\),\(NA_2=660\),\(NA_3=60965\)).
## EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3
## Min. :0.01 Min. :0.0000 Min. :0.00
## 1st Qu.:0.33 1st Qu.:0.3925 1st Qu.:0.37
## Median :0.51 Median :0.5660 Median :0.54
## Mean :0.50 Mean :0.5144 Mean :0.51
## 3rd Qu.:0.68 3rd Qu.:0.6636 3rd Qu.:0.67
## Max. :0.96 Max. :0.8550 Max. :0.90
## NA's :173378 NA's :660 NA's :60965
# Select rows with no External Source Data
n_empty_ext <- nrow(data %>% filter(is.na(EXT_SOURCE_1) & is.na(EXT_SOURCE_2) & is.na(EXT_SOURCE_3)))Removing rows with missing data would shrink the dataset considerably. Imputation with the mean value for each column is a possibility. Another option is to create a new variable combining the scores into an AVG_EXT_SCORE since they are all normalized values. The benefit to this approach is that only 172 rows have no value for any of the 3 columns and so very little data would need to be imputed or removed. This approach takes advantage of available information, while reducing missing data. However it is possible that some information loss will occur with this approach if one of these sources turns out to be much more informative than one or both of the others.
The final large group of related data deals with credit inquiries from the credit bureau over the previous year. These are divided into the hour, day, week, month, quarter, and year (each time period is exclusive of the time period before it) prior to the application being submitted. This data is strongly right-skewed and is 86.5% complete (NAs = 41519).
summary(data %>% select(AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,
AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR))## AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
## Min. :0.00 Min. :0.00 Min. :0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0
## Median :0.00 Median :0.00 Median :0.00 Median : 0.00 Median : 0.00 Median : 1.0
## Mean :0.01 Mean :0.01 Mean :0.03 Mean : 0.27 Mean : 0.27 Mean : 1.9
## 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.:0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 3.0
## Max. :4.00 Max. :9.00 Max. :8.00 Max. :27.00 Max. :261.00 Max. :25.0
## NA's :41519 NA's :41519 NA's :41519 NA's :41519 NA's :41519 NA's :41519
While we could remove the 41519 rows missing data, the loss of other information in those records might be valuable, and so it is proposed to impute the missing values with the median value for each column, in most cases this is 0.
There are a few remaining columns that contain missing data OWN_CAR_AGE, AMT_GOODS_PRICE, AMT_ANNUITY, CNT_FAM_MEMBERS,and DAYS_LAST_PHONE_CHANGE. Of these columns only OWN_CAR_AGE has a significant number of missing values. There is another column which is related to car ownership, FLAG_OWN_CAR, which is a boolean field. Upon closer inspection it is evident that all of those records where FLAG_OWN_CAR = ‘N’ contain NAs, conversely only 5 records with a FLAG_OWN_CAR = ‘Y’ contain missing values.
CrossTable(data$FLAG_OWN_CAR, is.na(data$OWN_CAR_AGE), prop.r = F, prop.c = F, prop.chisq = F, dnn = c('Owns Car','Contains Missing Values'))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 307511
##
##
## | Contains Missing Values
## Owns Car | FALSE | TRUE | Row Total |
## -------------|-----------|-----------|-----------|
## N | 0 | 202924 | 202924 |
## | 0.000 | 0.660 | |
## -------------|-----------|-----------|-----------|
## Y | 104582 | 5 | 104587 |
## | 0.340 | 0.000 | |
## -------------|-----------|-----------|-----------|
## Column Total | 104582 | 202929 | 307511 |
## -------------|-----------|-----------|-----------|
##
##
Therefore, it is proposed to remove these 5 records, and to convert the OWN_CAR_AGE into a factor variable with levels: No Car Owned, 0-3 years, 4-6 years, etc.
The other columns contain so few observations that it is proposed that those rows with missing values be excluded.
skm_num <- skm %>% filter(skim_type == 'numeric') %>% filter(!skim_variable %in% housing_stats_cols) %>% select(skim_variable, numeric.p0, numeric.mean, numeric.p100, numeric.hist)
skm_numSome modelling techniques are susceptible to outsized influence by outliers (regression for example). After examining the distribution of the numeric data a few columns stand out as containing likely outliers. The first examples are CNT_FAM_MEMBERS and CNT_CHILDREN.
# Create dataframe with columns of concern for plotting
num_problem_df <- data %>% select(CNT_CHILDREN, CNT_FAM_MEMBERS, AMT_INCOME_TOTAL, OWN_CAR_AGE, DAYS_LAST_PHONE_CHANGE, DAYS_EMPLOYED) %>% filter(!is.na(CNT_FAM_MEMBERS))
# Create Plots for family columns
plt_fam <- ggplot(num_problem_df, aes(CNT_FAM_MEMBERS)) +
geom_boxplot()
plt_chld <- ggplot(num_problem_df, aes(CNT_CHILDREN)) +
geom_boxplot()
grid.arrange(plt_fam, plt_chld, ncol=2, top=textGrob('Family Count Variables Distribution'))Both of these columns (which are related) contain a small number of outlier records, however the information is consistent and is likely to have an influence on the target variable. As such, it is recommended that this variable is normalized before it is used in any modelling techniques that are suceptible to the presence of outliers.
In the case of AMT_INCOME_TOTAL there is a single outlier that lies so far out of the scale (AMT_INCOME_TOTAL = 117,000,000) and is inconsistent with other data, and so is probably erroneous. Even if not erroneous, the value added by this datapoint to any model is negligible as the core customer group is not multi-millionaires.
plt_inc <- ggplot(num_problem_df, aes(AMT_INCOME_TOTAL)) +
geom_boxplot() +
ggtitle('Total Income Distribution')
# data %>% filter(AMT_INCOME_TOTAL > 100000000)
plt_inc2 <- ggplot(data %>% filter(AMT_INCOME_TOTAL < 100000000), aes(AMT_INCOME_TOTAL)) +
geom_boxplot() +
ggtitle('Total Income Distribution (After Removal)')
grid.arrange(plt_inc,plt_inc2, ncol=2, top=textGrob('Total Income Variable Distribution'))
Even after removal, outliers remain and measures (see above) will have
to be taken to ensure they do not overly influence predictive models. In
the case of such strongly skewed data, it might be advisable to convert
the continuous variable into a categorical one and break the data into
bins for use in models.
ggplot(num_problem_df, aes(OWN_CAR_AGE)) +
geom_histogram(bins = 60) +
ggtitle('Age of Cars Owned Distribution')data %>% filter(OWN_CAR_AGE > 55) %>% select(SK_ID_CURR, OWN_CAR_AGE) %>% group_by('Cars older than 55 years' = as.factor(OWN_CAR_AGE))%>%count()There is an odd anomaly in the data such that a large number of owned cars are recorded as being 64-65 years old. It seems unlikely that this data correctly represents reality. The amount of erroneous data is small and so removal wouldn’t impact the overall dataset to a large degree. However, the other data contained in these records may be informative, and so it is proposed that if the variable OWN_CAR_AGE is to be used in a model, that these rows be removed.
ggplot(num_problem_df, aes(DAYS_LAST_PHONE_CHANGE)) +
geom_histogram() +
ggtitle('Days Since Previous Phone Change')data %>% filter(DAYS_EMPLOYED > 0) %>% select(SK_ID_CURR, DAYS_EMPLOYED) %>% group_by('Days Employed' = as.factor(DAYS_EMPLOYED))%>%count()ggplot(num_problem_df %>% filter(DAYS_EMPLOYED < 0), aes(DAYS_EMPLOYED)) +
geom_histogram() +
ggtitle('Days Employed')Finally, DAYS_EMPLOYED has significant data integrity issues. The way the data is set up only values less than or equal to 0 should be permissible. However, a significant number of rows (n = 55374) all have the same positive value, DAYS_EMPLOYED = 365243. Not only should the value be negative, but the scale is also outside of realistic values (365243 days = 1000+ years). Due to these rows comprising a significant portion of the data set (18%) removal would be detrimental to model accuracy, so it is proposed that the data be imputed for these rows, being replaced by the column median (median = -1648)
app_train %>%
mutate(n_missing = rowSums(is.na(.)),
p_missing = n_missing/ncol(.)) %>%
ggplot() +
geom_histogram(aes(p_missing),
binwidth = 0.05, fill = "darkred", color = "white") +
stat_bin(aes(p_missing, y = after_stat(count), label = ifelse(after_stat(count) == 0, "", after_stat(count))),
geom = "text", binwidth = 0.05, size = 6, fontface = "bold", vjust = 0) +
scale_x_continuous(breaks = seq(0,1,0.1), minor_breaks = NULL) +
scale_y_continuous(labels = ~paste0(.%/%1000, "k")) +
labs(title = "Distribution of missing values by row",
x = "percent missing") +
theme_minimal()There are many categorical variables in this table which are encoded numerically. With 122 total columns, it’s rather arduous to distinguish true numerical variables but selecting columns that only contain numbers (and NA’s) is an obvious start.
# Ensuring data types were properly attributed during import
app_train %>%
summarise(across(everything(),
list(assigned = class,
observed = \(xx){
case_when(grepl("^(\\d|\\.|-)+$", xx) | is.na(xx) ~ "numeric",
.default = "character")
},
vsample = ~as.character(.))) %>%
first) %>%
pivot_longer(everything(),
names_to = c("column", ".value"),
names_pattern = "(.*)_(.*)") %>%
filter(assigned != observed) # subset mismatchesSince no columns where misclassed during the importing process, we can move on to subsetting the numerical columns.
7.3 Social Circle Data
Looking at the remaining columns with missing data 4 of them contain information about the applicant’s social circle and the 30-day and 60-day days past due default status, there is minimal missing data (n=1021)
Given that there are relatively few rows with missing data and the data is heavily right-skewed, it is proposed that missing data in these columns be replaced with imputed zero values (median= 0),